Introduction

Cancer is a term for uncontrolled proliferation of cells in our body. At the heart of uncontrolled cell proliferation are cellular signaling events. Virtually every aspect of the cell is regulated by post-translational modifications (PTMs). The most extensively-studied PTM is phosphorylation. The phosphorylation state of the proteome differs significantly between normal and cancer cells (Evan and Vousden 2001, Easty and Bennett (2000), Blume-Jensen and Hunter (2001)). Those phosphorylations are being carried out by protein kianses. Each kinase has

Our laboratory has developed a precision technique that measures the core substrate specificities of individual protein kinases with high reproducibility (Turk et al. 2001, Hutti et al. (2004)). This assay focuses on the 4-6 amino acids neighboring the serine, threonine or tyrosine of interest, a region kinases are exquisitely selective of. The probability matrix describing the favorability for each amino-acid in every position called the substrate specificy or the motif of the kinase.

Different kinases has different substrate specificity (motif) for the sequence around the central phospho-acceptor in order to perform phosphorylation. In our lab we have characterized the motifs for the entire Tyrosine kinome (80 kinases). From this profiling, we noticed that evolutionary-closed kinases from the same family have similar motifs. Therefore, the immidiate question is - why there are kinases with very similar motifs?

Our hypothesis is that the different kinases are expressed in different tissues, and in most of the time are mutually exclusive. An initial analysis of scRNA-seq of mice organs (Tabula Muris (Consortium and others 2018)) showed evidences supporting this hypothesis.

However, there are two major problems with the Tabula Muris database: 1. It is single-cell RNA-seq, which is still relatively new field and lack of consensus regarding the analysis methods which should be used for it. 2. Even though there is a high correlation between mice and human tissues, it is better to test our hypothesis in human tissues.

Hence, I believe that NGS data, specifically bulk RNA-seq from different human tissues, can be very valuable in order to test this hypothesis. Practically, I am planning to compare the expression levels of the kinases in the Tyrosine kinome for different Tyrosine kinases’ families, and to check whether there is a mutually exclusive expression pattern or not.

In that project, I am testing our hypothesis for 7 different kinases families (DDR, SrcA, SrcB, Type III RTK, TAM, INSR/IGF1R and Eph) tyrosine kinases, which each one of the kinases within each family has similar substrate spcificity. To that end, I will examine 6 different tissues - bladder, blood, brain, liver, lung, skin and thyroid, whereas for each tissue I will be use 5 samples (except for bladder in which I will use 8 samples). I will process and analyze the RNA-seq data, extract the differential expression of my specific genes in different tissue, and test our hypothesis based on those results.

Database

For targeting this research question, I am planning to use The Genotype-Tissue Expression (GTEx) database (https://gtexportal.org/home/), which is a portal for RNA-seq of human tissues of postmortem donors. This database contains 53 different tissues from 714 different donors, and 11688 samples in total. most of the samples (10361) have also genotype information. The raw data (fasta files) are accessible upon request.

For the project I chose 7 different tissues (bladder, blood, brain, liver, lung, skin and thyroid), 8 samples from bladder and 5 samples from the rest of the tissues:

library(knitr)
library(kableExtra)

samples <- read.table("./databases/project_samples.txt", sep = '\t', header = TRUE)
kable(samples) %>% kable_styling() %>% scroll_box(width = "100%", height = "200px")
Run SRA_Sample Sample_Name histological_type is_tumor sex SRA_Study
SRR1071717 SRS524326 GTEX-U3ZM-0826-SM-4DXU6 Bladder No male SRP012682
SRR1079830 SRS524681 GTEX-U4B1-1226-SM-4DXT7 Bladder No male SRP012682
SRR1081765 SRS524765 GTEX-S4Q7-0926-SM-4AD5D Bladder No male SRP012682
SRR1084917 SRS524903 GTEX-SNOS-0526-SM-4DM54 Bladder No male SRP012682
SRR1085402 SRS524923 GTEX-SNMC-0826-SM-4DM66 Bladder No male SRP012682
SRR1086236 SRS524959 GTEX-TMMY-1526-SM-4DXST Bladder No female SRP012682
SRR1099957 SRS526556 GTEX-S3XE-1226-SM-4AD4L Bladder No male SRP012682
SRR1120296 SRS534674 GTEX-U3ZN-1226-SM-4DXUD Bladder No female SRP012682
SRR613723 SRS374901 GTEX-R55F-0005-SM-2TF4W Blood No male SRP012682
SRR614059 SRS374925 GTEX-PWN1-0006-SM-2S1NG Blood No female SRP012682
SRR614131 SRS374918 GTEX-RTLS-0006-SM-2TF58 Blood No female SRP012682
SRR614167 SRS374895 GTEX-RM2N-0006-SM-2TF5H Blood No male SRP012682
SRR615177 SRS374896 GTEX-QDVJ-0005-SM-2TC5X Blood No male SRP012682
SRR595926 SRS332984 GTEX-N7MS-0011-R10A-SM-2HMJK Brain No male SRP012682
SRR598253 SRS332955 GTEX-Q2AG-0011-R6A-SM-2HML7 Brain No female SRP012682
SRR603397 SRS332972 GTEX-NPJ7-0011-R2a-SM-2I3GF Brain No female SRP012682
SRR627449 SRS377771 GTEX-NPJ8-1526-SM-26GMY Brain No male SRP012682
SRR627462 SRS377775 GTEX-NPJ8-2626-SM-26GMZ Brain No male SRP012682
SRR1093861 SRS525377 GTEX-XXEK-1126-SM-4BRUX Liver No male SRP012682
SRR1095383 SRS525982 GTEX-QDVN-0826-SM-48TZ2 Liver No male SRP012682
SRR659649 SRS389590 GTEX-U3ZN-0226-SM-3DB8D Liver No female SRP012682
SRR807971 SRS408127 GTEX-OOBJ-0826-SM-3NB2K Liver No male SRP012682
SRR815711 SRS408539 GTEX-P78B-1326-SM-3P611 Liver No female SRP012682
SRR600417 SRS333244 GTEX-Q2AI-0526-SM-2I3EJ Lung No male SRP012682
SRR600584 SRS333222 GTEX-OHPL-0526-SM-2HMIX Lung No female SRP012682
SRR600900 SRS333156 GTEX-PX3G-0526-SM-2I3EM Lung No female SRP012682
SRR601900 SRS333310 GTEX-OHPJ-1026-SM-2HMLE Lung No male SRP012682
SRR665217 SRS389763 GTEX-U3ZH-0526-SM-3DB75 Lung No male SRP012682
SRR614206 SRS374955 GTEX-NPJ8-0126-SM-2YUNR Skin No male SRP012682
SRR614755 SRS374966 GTEX-OXRO-0126-SM-2YUN4 Skin No female SRP012682
SRR614779 SRS374973 GTEX-PWCY-1826-SM-2S1OK Skin No female SRP012682
SRR615008 SRS374950 GTEX-PLZ5-2026-SM-2S1O4 Skin No male SRP012682
SRR615623 SRS374957 GTEX-OXRL-0126-SM-2YUMP Skin No male SRP012682
SRR655277 SRS389076 GTEX-WFG8-0426-SM-3GILD Thyroid No male SRP012682
SRR656191 SRS389105 GTEX-S7SE-0726-SM-2XCD7 Thyroid No male SRP012682
SRR656373 SRS389163 GTEX-WEY5-0526-SM-3GIKZ Thyroid No female SRP012682
SRR656469 SRS389159 GTEX-OXRP-0326-SM-33HBJ Thyroid No female SRP012682
SRR656528 SRS389160 GTEX-POYW-0826-SM-2XCEM Thyroid No male SRP012682

Results

The analysis had supported our hypothesis, and showed that when there are few kinses with very similar substrate specificity, they usually will be expressed in different tissues.

For the SrcA family (FGR, FYN, SRC and YES1), we can see that their substrate specificities are very similar to each other:

According to the Tabula Muris database, those kinases are expressed in different tissues. One of the the differences is that FYN and YES1 are expressed in the liver, whereas SRC and FGR are not:

and indeed, when we test the GTEx database, we can see that FYN and YES1 are higher expressed in liver than FGR and SRC, and that in general those kinases are usually expressed mutually exclusively in different tissues:

We can observe the same conclusion for the SrcB family (BLK, HCK, LCK and LYN). Also here, their substrate specificties matrices are very similar:

and according to Tabula Muris, HCK and LYN are highly expressed in brain tissues, whereas BLK and LCK are not:

According to the analysis, also in human bulk RNA-seq, HCK and LYN are much higher expressed in the brain than BLK and LCK:

and also in that case, the kinases are usually expressed in different tissues for one another:

Future analysis

In order to reinforce the evidences supporting our hypothesis, we should perform a complete analysis of all the tissues available on GTEx, and using all the samples from each tissue. By that, we will get a deeper observation regarding the expression patterns of different kinases.

In addition, it will be interesting to compare the expression analysis to the TCGA database, in order to see if there are any differences between normal tissues and cancer tissues.


Analysis

QC of the fastq files and choosing samples for analysis

I started with an initial fastQC analysis on the original files, and then presented it using MultiQC:

spack load fastqc

for f in databases/gtex_files/*/*/original/*
  do fastqc -o ./results/$(echo $f |cut -f 3-4 -d/)/fastqc_original $f --extract
done

multiqc ./Project/results/*/*/fastqc_original/* -o ./Project/results/combined/FastQC\ -\ Original/

As we can see, some of the samples have a significant adapter contamination and overrepresented sequences:

In addition, we can see that one sample (SRR600417 - lung), has relatively low quality score.

Moreover, some samples has high percentage of duplicated sequences:

Trimming adapters

Since the samples has adapter contamination, I trimmed the reads using TrimGalore:

spack load -r py-cutadapt
spack load -r trimgalore

for f in databases/gtex_files/*/*/original
  do trim_galore --illumina -o databases/gtex_files/$(echo $f |cut -f 3-4 -d/)/trimmed/ --paired $f/*
done

I chose the argument --length to remain 20bp (default). In addition, I didn’t specify the argument --paired so every read longer than 20bp after trimming will be preserved, even if its other paired-end is not preserved (since it is shorter than 20bp after trimming). That will allow us to use both paired-end and not paired-end alignment later on.

Then I ran FastQC again for the trimmed fastq files:

for f in databases/gtex_files/*/*/trimmed/*.gz
  do fastqc -o ~/angsd/project/results/$(echo $f |cut -f 3-4 -d/)/fastqc_trimmed $f --extract
done

multiqc ./Project/results/*/*/fastqc_trimmed/* -o ./Project/results/combined/FastQC\ -\ Trimmed/

As we can see, the overall quality got imporved. First, there are less duplicated sequences (mainly becuase many of them were adapters):

in addition, there are no overrepresented sequences and no adapter content.

Alignment

Since we are dealing here with RNA-seq, I used STAR. First, I indexed the reference genome (hg38) and then aligned the fastq files as paired-end:

spack load star@2.6.1

mkdir hg38_STAR_index
STAR --runMode genomeGenerate --runThreadN 8 --genomeDir gh38_STAR_index --genomeFastaFiles ./hg38.fa --sjdbGTFfile hg38_ucsc.gtf --sldbOverhang 75

for f in databases/gtex_files/*/*/trimmed
  do STAR --runMode alignReads --runThreadN 8 --genomeDir /athena/elementolab/scratch/toy2005/GenomeReference_hg38/hg38_STAR_index/ --readFilesIn $f/*.gz --readFilesCommand zcat --outFileNamePrefix /athena/elementolab/scratch/toy2005/angsd/project/databases/gtex_files/$(echo $f | cut -f 3-4 -d/)/star_aligned/$(echo $f | cut -f 4 -d/). --outSAMtype BAM SortedByCoordinate
  samtools stats databases/gtex_files/$(echo $f | cut -f 3-4 -d/)/star_aligned/$(echo $f | cut -f 4 -d/).Aligned.sortedByCoord.out.bam > ~/angsd/project/results/$(echo $f | cut -f 3-4 -d/)/alignment_qc/$(echo $f | cut -f 4 -d/).stats
  samtools flagstat databases/gtex_files/$(echo $f | cut -f 3-4 -d/)/star_aligned/$(echo $f | cut -f 4 -d/).Aligned.sortedByCoord.out.bam > ~/angsd/project/results/$(echo $f | cut -f 3-4 -d/)/alignment_qc/$(echo $f | cut -f 4 -d/).flagstat
done

Alignment QC:

spack load samtools@1.8

samtools stats databases/gtex_files/star_aligned/SRR807971/SRR807971_1.Aligned.sortedByCoord.out.bam > ~/angsd/project/results/SRR807971/alignment_qc/SRR807971_1.stats
samtools flagstat databases/gtex_files/star_aligned/SRR807971/SRR807971_1.Aligned.sortedByCoord.out.bam > ~/angsd/project/results/SRR807971/alignment_qc/SRR807971_1.flagstats

multiqc ./Project/results/*/*/alignment_qc/* -o ./Project/results/combined/Alignment\ QC/

As we can see, 100% of the reads were mapped:

Even though some of them where multi-mapped:

samples <- read.table("./results/combined/Alignment QC/samples_alignment.txt", sep = '\t', header = TRUE)
kable(samples) %>% kable_styling() %>% scroll_box(width = "100%", height = "200px")
Sample.Name M.Reads.Mapped X..Mapped M.Non.Primary X..MapQ.0.Reads Error.rate X..Proper.Pairs
SRR1071717 44.7 100.00% 2.6 0.20% 0.00% 99.90%
SRR1079830 54.8 100.00% 2.9 0.20% 0.00% 99.90%
SRR1081765 45.5 100.00% 2.5 0.20% 0.00% 99.90%
SRR1084917 57.1 100.00% 4.6 0.30% 0.00% 99.90%
SRR1085402 55.8 100.00% 3.7 0.30% 0.00% 99.90%
SRR1086236 72.6 100.00% 4.1 0.20% 0.00% 99.90%
SRR1093861 71.6 100.00% 4.9 0.20% 0.00% 99.90%
SRR1095383 74.2 100.00% 5.8 0.20% 0.00% 99.90%
SRR1099957 61.4 100.00% 3.8 0.30% 0.00% 99.90%
SRR1120296 59.4 100.00% 3.6 0.30% 0.00% 99.90%
SRR595926 114.7 100.00% 8.0 0.20% 0.00% 99.70%
SRR598253 116.7 100.00% 7.2 0.20% 0.00% 99.70%
SRR600417 97.2 100.00% 6.5 0.30% 0.00% 99.20%
SRR600584 70.7 100.00% 3.8 0.20% 0.00% 99.70%
SRR600900 93.1 100.00% 5.6 0.20% 0.00% 99.80%
SRR601900 64.3 100.00% 3.6 0.30% 0.00% 99.70%
SRR603397 82.1 100.00% 5.1 0.20% 0.00% 99.80%
SRR613723 78.7 100.00% 31.7 0.50% 0.00% 99.70%
SRR614059 15.9 100.00% 4.3 0.80% 0.00% 99.60%
SRR614131 140.7 100.00% 55.4 0.50% 0.00% 99.80%
SRR614167 65.2 100.00% 20.4 0.60% 0.00% 99.80%
SRR614206 34.9 100.00% 2.6 0.30% 0.00% 99.70%
SRR614755 96.9 100.00% 5.8 0.30% 0.00% 99.80%
SRR614779 11.0 100.00% 0.8 0.30% 0.00% 99.70%
SRR615008 31.3 100.00% 2.1 0.30% 0.00% 99.70%
SRR615177 109.2 100.00% 36.0 0.70% 0.00% 99.70%
SRR615623 69.4 100.00% 4.4 0.30% 0.00% 99.80%
SRR627449 59.2 100.00% 9.2 0.40% 0.00% 99.10%
SRR627462 79.9 100.00% 25.2 0.60% 0.00% 99.10%
SRR655277 94.7 100.00% 5.6 0.20% 0.00% 99.80%
SRR656191 62.9 100.00% 4.3 0.30% 0.00% 99.70%
SRR656373 71.8 100.00% 4.3 0.20% 0.00% 99.80%
SRR656469 130.3 100.00% 8.7 0.30% 0.00% 99.70%
SRR656528 78.8 100.00% 5.5 0.40% 0.00% 99.60%
SRR659649 125.5 100.00% 8.2 0.20% 0.00% 99.80%
SRR665217 139.3 100.00% 8.6 0.30% 0.00% 99.70%
SRR807971 114.2 100.00% 9.3 0.30% 0.00% 99.80%
SRR815711 114.4 100.00% 9.3 0.20% 0.00% 99.90%

Gene count

spack load subread

for f in databases/gtex_files/*/*/star_aligned/*.Aligned.sortedByCoord.out.bam
  do featureCounts -T 2 -t gene -f -a databases/GenomeReference_hg38/annotation/Homo_sapiens.GRCh38.77.gtf -o databases/gtex_files/$(echo $f | cut -f 3-4 -d/)/counts/$(echo $f| cut -f 4 -d/)_counts.txt $f
done

I examined the number of mapped reads in every sample:

library(ggplot2)
library(reshape2)

file_names <- list.files(pattern = "\\.summary$", recursive = TRUE)
m <- matrix(0, ncol = 3, nrow = length(file_names))

gene_counts <- data.frame(m)
colnames(gene_counts) <- c('Assigned','Unassigned_NoFeatures','Unassigned_Ambiguity')
rownames(gene_counts) <- lapply(strsplit(file_names,'/'), function(x) paste(x[2:3], collapse = '/'))

for (f in file_names){
  sample <- paste(strsplit(f,'/')[[1]][2:3], collapse = '/')
  counts <- read.table(f, row.names = 1, header = TRUE)
  for (att in colnames(gene_counts)){
    gene_counts[sample,att] <- counts[att,]
  }
}

gene_counts$Category  <- row.names(gene_counts)
gene_counts_melt <- melt(gene_counts, value.name="Count", variable.name="Variable")
ggplot(data=gene_counts_melt, aes(x=Category, y=Count, fill=Variable)) + geom_bar(stat="identity", position=position_dodge()) + ggtitle("Reads Counts Per Gene - No Multi-Mapped Reads") + xlab("Sample") + ylab("Read Counts") + labs(fill = "Type") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

As we can see in the figure above, most of the reads in every sample were assigned uniquely. However, there is a large variance in the amount of reads between samples, which can be a result of different sequencing depth. Therefore, I normalized the counts matrix using rlog and compare the differential expression.

library(DESeq2)

file_names <- list.files(pattern = "._counts.txt$", recursive = TRUE)
sample_matrix <- read.table(file_names[1], row.names = 1, header = TRUE)
m <- matrix(0, ncol = length(file_names), nrow = dim(sample_matrix)[1])

counts_matrix <- data.frame(m)
colnames(counts_matrix) <- lapply(strsplit(file_names,'/'), function(x) paste(x[2:3], collapse = '/'))
rownames(counts_matrix) <- rownames(sample_matrix)

# for (f in file_names){
#   sample <- paste(strsplit(f,'/')[[1]][2:3], collapse = '/')
#   sample_matrix <- read.table(f, row.names = 1, header = TRUE)
#   counts_matrix[,sample] <- sample_matrix[,ncol(sample_matrix)]
# }
# write.table(counts_matrix, file = './results/combined/Combined counts/counts_samples.txt', sep = '\t')
counts_matrix <- read.table('./results/combined/Combined counts/counts_samples.txt', sep = '\t')
colnames(counts_matrix) <- gsub('\\.','/',colnames(counts_matrix))

sample.info <- DataFrame(condition = sapply(strsplit(names(counts_matrix),'/'), function(x) x[1]), row.names = names(counts_matrix))
DESeq.ds <- DESeqDataSetFromMatrix(countData = counts_matrix, colData = sample.info, design = ~ condition)
# DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)
# rlog.norm.counts <- assay(DESeq.rlog)

#write.table(rlog.norm.counts, file = './results/combined/Combined counts/rlog_counts_samples.txt', sep = '\t')
rlog.norm.counts <- read.table('./results/combined/Combined counts/rlog_counts_samples.txt', sep = '\t')
colnames(rlog.norm.counts) <- gsub('\\.','/',colnames(rlog.norm.counts))
rlog.norm.counts <- as.matrix(rlog.norm.counts)

ggplot(aes(x=value, colour=Var2), data=melt(rlog.norm.counts)) + geom_density() + ggtitle("Expression Distribution") + theme(plot.title = element_text(hjust = 0.5)) 

plot(hclust(dist(t(rlog.norm.counts))), labels = colnames(rlog.norm.counts), main = "rlog transformed read counts - Euclidean distance")

corr_coeff <- cor(rlog.norm.counts, method = "pearson")
plot(hclust(as.dist(1 - corr_coeff)), labels = colnames(rlog.norm.counts), main = "rlog transformed read counts - Correlation")

We can see that using euclidian distance and correlation, all the samples are clustered together, expcept for the bladder samples which are clustered in 2 groups of 4.

Exploratory Analysis

I ran PCA and Tsne plots over the gene expression of the samples, expecting to get similar samples clustered together (due to similar expression pattern).

library(Rtsne)
library(scatterplot3d)

rv <- rowVars(rlog.norm.counts)
top_variable <- order(rv, decreasing = TRUE)
pca <- prcomp(t(rlog.norm.counts))
pca.prop.var <- ((pca$sdev^2) / (sum(pca$sdev^2)))*100
df_pca_data <- data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2], sample = sample.info$condition, condition = sample.info$condition)

ggplot(df_pca_data, aes(PC1,PC2, color = sample)) + geom_point(size=2) + labs(title = "PCA plot", x = paste0("PC1 (",pca.prop.var[1],")"), y = paste0("PC2 (",pca.prop.var[2],")"), colour="Tissue") + theme(plot.title = element_text(hjust = 0.5))

set.seed(1)
counts_tsne <- Rtsne(t(unique(rlog.norm.counts)), perplexity = 5, dims = 2, check_duplicates = FALSE)
ggplot(as.data.frame(counts_tsne$Y), aes(x = counts_tsne$Y[,1], y = counts_tsne$Y[,2], colour = sample.info$condition)) + geom_point(alpha = 1) + labs(title = "Tsne Plot", x = "Tsne 1", y = "Tsne 2", colour="Tissue") + theme(plot.title = element_text(hjust = 0.5))

As we can see, the different samples are clustered together in the PCA plot, and are even more clustered in the Tsne plot. Interestignly, also here the 8 bladder samples are clustered in 2 groups of 4.

Expression Analysis - SrcA Family

Finally, in order to test our hypothesis for a specific family, I compared the expression levels of each kinase within that family across tissues, and performed a two-tailed independent T-test for find statistically significant changes in the expression levels. Here, I am presenting the results for two specific families (SrcA and SrcB) in two tissues (liver and brain respectively), whereas the results for all the families across all tissues are provided in the Supplementary Information section.

For the SrcA family, according to the Tabula Muris data, we can see that FYN and YES1 are expressed in liver tissue, whereas SRC and FGR are not:

In an agreement with this data, we can see that in liver tissues from the GTEx database, FYN and YES1 are expressed in higher levels than FGR and SRC:

library(biomaRt)
library(reshape2)
library(ggsignif)
library(gridExtra)
library(gplots)

ensembl <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))

kinases_family <- "SrcA Family"
kinases <- sort(c("SRC","FYN","YES1","FGR"))

kin_ensembl <- getBM(attributes = "ensembl_gene_id", filters="hgnc_symbol", values=kinases, mart=ensembl)
kin_ensembl <- kin_ensembl$ensembl_gene_id[unlist(lapply(kin_ensembl, function(x) x %in% rownames(rlog.norm.counts)))]

kin_exp <- rlog.norm.counts[kin_ensembl,]
kin_exp <- kin_exp[rowSums(kin_exp) > 0,]
rownames(kin_exp) <- kinases
colnames(kin_exp) <- sapply(strsplit(colnames(kin_exp),'/'), function(x) x[1])

tissues <- sort(unique(sapply(strsplit(colnames(DESeq.ds),'/'), function(x) x[1])))
sample_exp <- melt(kin_exp)[,c(2,1,3)]
colnames(sample_exp) <- c("tissue","kinase","exp_value")

t <- "liver"
tissue_exp <- sample_exp[sample_exp$tissue == t,]
plt <- ggplot(tissue_exp, aes(x=kinase, y=exp_value, fill=kinase)) + geom_boxplot() + geom_dotplot(binaxis='y', stackdir='center', binwidth = 0.3, dotsize=0.7) + ggtitle(t) + theme(plot.title = element_text(hjust = 0.5)) + geom_signif(comparisons = split(combn(kinases,2), rep(1:ncol(combn(kinases,2)), each = nrow(combn(kinases,2)))), map_signif_level=TRUE, step_increase = 0.1, vjust = 0.5, textsize = , margin_top = 0.1) + xlab("Kinase") + ylab("Expression Level (rlog)")
plot(plt)

For the SrcA family, we will plot a heatmap of the expression of the kianses across different tissues:

heatmap.2(t(kin_exp), Colv = FALSE, Rowv = FALSE, dendrogram = 'none', col = bluered(25), trace = 'column', density.info = 'density', cexCol = 1, srtCol = 45, cexRow = 1, rowsep=c(8,13,18,23,28,33), sepcolor="black", sepwidth=c(5,0.2), labRow = c(NA,NA,NA,"Bladder",NA,NA,NA,NA,NA,NA,"Blood",NA,NA,NA,NA,"Brain",NA,NA,NA,NA,"Liver",NA,NA,NA,NA,"Lung",NA,NA,NA,NA,"Skin",NA,NA,NA,NA,"Thyroid",NA,NA), tracecol = "black", key.title = NA, main = kinases_family, xlab = "Kinase", ylab = "Tissue")

As we can see, the kinases expression are mostly mutually exclusive, i.e. when a one or two kinaes are highly expressed in a tissue, the other kinases are lower expressed.

Expression Analysis - SrcB Family

For the SrcB family, according to the Tabula Muris data, we can see that HCK and LYN are expressed in brain cells, whereas BLK and LCK are not:

And also in that case, in consistency with this data, we can see that in brain samples from the GTEx database, HCK and LYN are expressed in higher levels than BLK and LCK:

kinases_family <- "SrcB Family"
kinases <- sort(c("LCK","HCK","BLK","LYN"))

kin_ensembl <- getBM(attributes = "ensembl_gene_id", filters="hgnc_symbol", values=kinases, mart=ensembl)
kin_ensembl <- kin_ensembl$ensembl_gene_id[unlist(lapply(kin_ensembl, function(x) x %in% rownames(rlog.norm.counts)))]

kin_exp <- rlog.norm.counts[kin_ensembl,]
kin_exp <- kin_exp[rowSums(kin_exp) > 0,]
rownames(kin_exp) <- kinases
colnames(kin_exp) <- sapply(strsplit(colnames(kin_exp),'/'), function(x) x[1])

tissues <- sort(unique(sapply(strsplit(colnames(DESeq.ds),'/'), function(x) x[1])))
sample_exp <- melt(kin_exp)[,c(2,1,3)]
colnames(sample_exp) <- c("tissue","kinase","exp_value")

t <- "brain"
tissue_exp <- sample_exp[sample_exp$tissue == t,]
plt <- ggplot(tissue_exp, aes(x=kinase, y=exp_value, fill=kinase)) + geom_boxplot() + geom_dotplot(binaxis='y', stackdir='center', binwidth = 0.3, dotsize=0.7) + ggtitle(t) + theme(plot.title = element_text(hjust = 0.5)) + geom_signif(comparisons = split(combn(kinases,2), rep(1:ncol(combn(kinases,2)), each = nrow(combn(kinases,2)))), map_signif_level=TRUE, step_increase = 0.1, vjust = 0.5, textsize = , margin_top = 0.1) + xlab("Kinase") + ylab("Expression Level (rlog)")
plot(plt)

For the SrcA family, we will plot a heatmap of the expression of the kianses across different tissues:

heatmap.2(t(kin_exp), symkey = FALSE, Colv = FALSE, Rowv = FALSE, dendrogram = 'none', col = bluered(25), trace = 'column', density.info = 'density', cexCol = 1, srtCol = 45, cexRow = 1, rowsep=c(8,13,18,23,28,33), sepcolor="black", sepwidth=c(5,0.2), labRow = c(NA,NA,NA,"Bladder",NA,NA,NA,NA,NA,NA,"Blood",NA,NA,NA,NA,"Brain",NA,NA,NA,NA,"Liver",NA,NA,NA,NA,"Lung",NA,NA,NA,NA,"Skin",NA,NA,NA,NA,"Thyroid",NA,NA), tracecol = "black", key.title = NA, main = kinases_family, xlab = "Kinase", ylab = "Tissue")

Also here, the kinases expression are mostly mutually exclusive.


Discussion

First and foremost, in order for the analysis to be complete, we need to test those expression patterns across all the available tissues, using all the available samples. In addition, during the analysis I have encountered some issues I had to address:

  1. Sample size: in order to test this hypothesis, I had to compare the expression levels of different kinases in a tissue, and therefore i had to use many samples for performing a statistical comparison. I used 5 samples per tissue, but in the future all the samples will have to be used.

  2. Kinase expression level: in our cells, there are sometimes many different isoforms for the same kinase. In order to obtain the expression level of the kianse on average, I used the feature “gene” for counting the features. Then, I considered several different normalization methods (TPM, SF and rlog), and evetually I picked the rlog normalization since the library size (sequencing depth etc.) is highly varied between samples and tissues.

  3. Statistical test: usually, people are using RNA-seq data to compare the same genes between different samples. In that case, I tried to compare different genes across one sample type at a time. Therefore, once i got the expression pattern, i had to use a statistical test for finding the differential expression. After trying some widely used tests (T-test, wilcoxon test, ANOVA), I saw that there are not much difference between them, and I decided to use the classical T-test for the statistical significance.

For the full results of all the kinases families tested, please refer to the Supplementary Information section.


Supplementary Information

all_families <- list("DDR Family","SrcA Family","SrcB Family","Type III RTK Family","TAM Family","INSR/IGF1R Receptor Family","Eph Family")
family_kinases <- list(sort(c("DDR1","DDR2")), sort(c("SRC","FYN","YES1","FGR")), sort(c("LCK","HCK","BLK","LYN")), sort(c("CSF1R","KIT")), sort(c("TYRO3","AXL","MERTK")), sort(c("INSR","IGF1R")), sort(c(paste0("EphA",1:8), paste0("EphB",c(1,2,4)))))

# kinases_family <- "DDR Family"
# kinases <- sort(c("DDR1","DDR2"))
# kinases_family <- "SrcA Family"
# kinases <- sort(c("SRC","FYN","YES1","FGR"))
# kinases_family <- "SrcB Family"
# kinases <- sort(c("LCK","HCK","BLK","LYN"))
# kinases_family <- "Type III RTK Family"
# kinases <- sort(c("CSF1R","KIT"))
# kinases_family <- "TAM Family"
# kinases <- sort(c("TYRO3","AXL","MERTK"))
# kinases_family <- "INSR/IGF1R Receptor Family"
# kinases <- sort(c("INSR","IGF1R"))
# kinases_family <- "Eph Family"
# kinases <- sort(c(paste0("EphA",1:8), paste0("EphB",c(1,2,4))))

for (kin_fam in mapply(list, all_families, family_kinases, SIMPLIFY=F)){
  
  kinases_family <- kin_fam[[1]]
  kinases <- kin_fam[[2]]
  
  print(kinases_family)
  
  kin_ensembl <- getBM(attributes = "ensembl_gene_id", filters="hgnc_symbol", values=kinases, mart=ensembl)
  kin_ensembl <- kin_ensembl$ensembl_gene_id[unlist(lapply(kin_ensembl, function(x) x %in% rownames(rlog.norm.counts)))]
  
  kin_exp <- rlog.norm.counts[kin_ensembl,]
  kin_exp <- kin_exp[rowSums(kin_exp) > 0,]
  rownames(kin_exp) <- kinases
  colnames(kin_exp) <- sapply(strsplit(colnames(kin_exp),'/'), function(x) x[1])
  
  tissues <- sort(unique(sapply(strsplit(colnames(DESeq.ds),'/'), function(x) x[1])))
  sample_exp <- melt(kin_exp)[,c(2,1,3)]
  colnames(sample_exp) <- c("tissue","kinase","exp_value")
  
  plots <- list()
  
  for (i in 1:length(tissues)){
    t <- tissues[i]
    tissue_exp <- sample_exp[sample_exp$tissue == t,]
    # plots[[i]] <- ggplot(tissue_exp, aes(x=kinase, y=exp_value, fill=kinase)) + geom_boxplot() + geom_dotplot(binaxis='y', stackdir='center', binwidth = 0.3) + ggtitle(t) + theme(plot.title = element_text(hjust = 0.5)) + geom_signif(comparisons = split(combn(kinases,2), rep(1:ncol(combn(kinases,2)), each = nrow(combn(kinases,2)))), map_signif_level=TRUE, step_increase = 0.1, vjust = 0.5, textsize = , margin_top = 0.1) + xlab("Kinase") + ylab("Expression Level (rlog)")
    plots[[i]] <- ggplot(tissue_exp, aes(x=kinase, y=exp_value, fill=kinase)) + geom_boxplot() + geom_dotplot(binaxis='y', stackdir='center', binwidth = 0.3, dotsize = 0.5) + ggtitle(t) + theme(plot.title = element_text(hjust = 0.5)) +  xlab("Kinase") + ylab("Expression Level (rlog)")
    # plot(plots[[i]])
  }
  
  # pdf(paste0('./figures/', paste0(kinases, sep = "_", collapse = ""), 'exp_levels.pdf'), width = 10, height = 5, paper = 'special')
  do.call(grid.arrange,plots)
  # dev.off()
  
  plt <- ggplot(sample_exp, aes(x=tissue, y=exp_value, fill=kinase)) + geom_boxplot() + ggtitle(kinases_family) + theme(plot.title = element_text(hjust = 0.5)) + xlab("Tissue") + ylab("Expression Level (rlog)") + geom_vline(xintercept = (1:6 + 0.5), linetype = "dashed")
  plot(plt)
  
  heatmap.2(t(kin_exp), Colv = FALSE, Rowv = FALSE, dendrogram = 'none', col = bluered(25), trace = 'column', density.info = 'density', symkey = FALSE, cexCol = 1, srtCol = 45, cexRow = 1, rowsep=c(8,13,18,23,28,33), sepcolor="black", sepwidth=c(5,0.2), labRow = c(NA,NA,NA,"Bladder",NA,NA,NA,NA,NA,NA,"Blood",NA,NA,NA,NA,"Brain",NA,NA,NA,NA,"Liver",NA,NA,NA,NA,"Lung",NA,NA,NA,NA,"Skin",NA,NA,NA,NA,"Thyroid",NA,NA), tracecol = "black", key.title = NA, main = kinases_family, xlab = "Kinase", ylab = "Tissue", symbreaks = FALSE)
}
## [1] "DDR Family"

## [1] "SrcA Family"

## [1] "SrcB Family"

## [1] "Type III RTK Family"

## [1] "TAM Family"

## [1] "INSR/IGF1R Receptor Family"

## [1] "Eph Family"


References

Blume-Jensen, Peter, and Tony Hunter. 2001. “Oncogenic Kinase Signalling.” Nature 411 (6835). Nature Publishing Group: 355.

Consortium, Tabula Muris, and others. 2018. “Single-Cell Transcriptomics of 20 Mouse Organs Creates a Tabula Muris.” Nature 562 (7727): 367.

Easty, DJ, and DC Bennett. 2000. “Protein Tyrosine Kinases in Malignant Melanoma.” Melanoma Research 10 (5). LWW: 401–11.

Evan, Gerard I, and Karen H Vousden. 2001. “Proliferation, Cell Cycle and Apoptosis in Cancer.” Nature 411 (6835). Nature Publishing Group: 342.

Hutti, Jessica E, Emily T Jarrell, James D Chang, Derek W Abbott, Peter Storz, Alex Toker, Lewis C Cantley, and Benjamin E Turk. 2004. “A Rapid Method for Determining Protein Kinase Phosphorylation Specificity.” Nature Methods 1 (1). Nature Publishing Group: 27.

Turk, Benjamin E, Lisa L Huang, Elizabeth T Piro, and Lewis C Cantley. 2001. “Determination of Protease Cleavage Site Motifs Using Mixture-Based Oriented Peptide Libraries.” Nature Biotechnology 19 (7). Nature Publishing Group: 661.